It is well known to request a web page according to a Universal Resource Locator (URL). More particularly, when a web page corresponding to a particular URL is requested using an HTTP request, a service (e.g., a web site) associated with that URL provides, in response, web page content nominally associated with that URL. When the destination server for the service cannot fulfill the request, an HTTP standard error response code “404” may be returned. Typically, the client application (e.g., browser) interprets the 404 response code and provides a default message generated by the client application. URL's that lead to such 404 response codes are sometimes known as “dead links.”
More recently, the services themselves (by contrast to the client applications) have taken to providing an HTTP standard response code 200 (OK code), as opposed to an HTTP standard error response code 404, even when a web page corresponding to a dead link URL is requested. In addition, these services generally provide a default page that indicates (to a human user) that the client application requested a web page corresponding to a dead link. This type of response has become known as a “soft 404.”
To an automated process, the default page indicating that the requested web page corresponds to a dead link generally appears to be responsive and not indicative of a dead link. Thus, for example, unless a search engine accounts for it, the search engine may treat a web page corresponding to a soft 404-type dead link as an actual live web page, which can affect search results in what are probably unintended ways. Types of processing other than processing used by search engines can also benefit from knowledge of soft 404-type dead links.
According to one article, “Sic transit gloria telae: towards an understanding of the web's decay”, by Z Bar-Yossef et al. (2004), it is estimated that soft 404s account for more than twenty-five percent of the dead links on the web. The Z Bar-Yossef article proposes a method to detect whether a particular web page is a soft 404 page.